Method for decoding immersive video and method for encoding immersive video

ABSTRACT

A method of encoding an immersive video according to the present disclosure includes determining whether an input image is a first type, converting the input image into the first type when the input image is a second type different from the first type, encoding a converted image, and generating metadata for the encoded image.

TECHNICAL FIELD

The present disclosure relates to a method for encoding/decoding animmersive video which supports motion parallax for a rotation andtranslation motion.

BACKGROUND ART

A virtual reality service is evolving in a direction of providing aservice in which a sense of immersion and realism are maximized bygenerating an omnidirectional image in a form of an actual image or CG(Computer Graphics) and playing it on HMD, a smartphone, etc. Currently,it is known that 6 Degrees of Freedom (DoF) should be supported to playa natural and immersive omnidirectional image through HMD. For a 6DoFimage, an image which is free in six directions including (1) left andright rotation, (2) top and bottom rotation, (3) left and rightmovement, (4) top and bottom movement, etc. should be provided through aHMD screen. But, most of the omnidirectional images based on an actualimage support only rotary motion. Accordingly, a study on a field suchas acquisition, reproduction technology, etc. of a 6DoF omnidirectionalimage is actively under way.

DISCLOSURE Technical Problem

The present disclosure is to provide a method for encoding/decoding animmersive video in a unit of an object.

The present disclosure is to provide a method for encoding/decoding animmersive video with which heterogeneous images are combined.

The present disclosure is to provide a method for encoding/decodingattribute information on each of heterogeneous images.

The technical objects to be achieved by the present disclosure are notlimited to the above-described technical objects, and other technicalobjects which are not described herein will be clearly understood bythose skilled in the pertinent art from the following description.

Technical Solution

A method for encoding an immersive video according to the presentdisclosure includes determining whether an input image is a first type,converting the input image into the first type when the input image is asecond type different from the first type, encoding a converted image,and generating metadata for the encoded image.

A method for decoding an immersive video according to the presentdisclosure includes acquiring a plurality of bitstreams throughdemultiplexing, decoding at least one of the plurality of bitstreams,and rendering an immersive video based on a decoded image and decodedmetadata.

In a method for encoding/decoding an immersive video according to thepresent disclosure, the metadata may include video type information forthe encoded/decoded image.

In a method for encoding/decoding an immersive video according to thepresent disclosure, the encoded/decoded image is an image for apredetermined object and the metadata may include dynamic informationrepresenting a dynamic characteristic of the object.

In a method for encoding/decoding an immersive video according to thepresent disclosure, the dynamic information may indicate whether theobject is in a static state or in a dynamic state within a predeterminedperiod.

In a method for encoding/decoding an immersive video according to thepresent disclosure, the predetermined period is a service period of theencoded/decoded image and the metadata may further include durationinformation representing the service period.

In a method for encoding/decoding an immersive video according to thepresent disclosure, the predetermined period is represented in a unit ofGOP (Group of Pictures) and the dynamic information may beencoded/decoded per the predetermined period within a service period ofthe encoded/decoded image.

Technical Effects

According to the present disclosure, a method of encoding/decoding animmersive video in a unit of an object may be provided.

According to the present disclosure, a method of encoding/decoding animmersive video with which heterogeneous images are combined may beprovided.

According to the present disclosure, a method of encoding/decodingattribute information on each of heterogeneous images may be provided.

Effects achievable by the present disclosure are not limited to theabove-described effects, and other effects which are not describedherein may be clearly understood by those skilled in the pertinent artfrom the following description.

DESCRIPTION OF DIAGRAMS

FIG. 1 is a block diagram of an immersive video processing deviceaccording to an embodiment of the present disclosure.

FIG. 2 is a block diagram of an immersive video output device accordingto an embodiment of the present disclosure.

FIG. 3 is a flow chart of an immersive video processing method.

FIG. 4 is a flow chart of an atlas encoding process.

FIG. 5 is a flow chart of an immersive video output method.

FIG. 6 is a block diagram of an immersive video processing device whichsupports object-based encoding.

FIG. 7 illustrates an immersive video with which heterogeneous videosare combined.

FIG. 8 is a flow chart which represents an encoding/decoding process ofan immersive video shown in FIG. 7 .

FIG. 9 illustrates an attribute value of each of input videos which maybe represented by content description information.

FIGS. 10 and 11 represent a flow chart of an encoding/decoding processof an immersive video with which heterogeneous videos are combinedaccording to an embodiment of the present disclosure.

BEST MODE

As the present disclosure may make various changes and have multipleembodiments, specific embodiments are illustrated in a drawing and aredescribed in detail in a detailed description. But, it is not to limitthe present disclosure to a specific embodiment, and should beunderstood as including all changes, equivalents and substitutesincluded in an idea and a technical scope of the present disclosure. Asimilar reference numeral in a drawing refers to a like or similarfunction across multiple aspects. A shape and a size, etc. of elementsin a drawing may be exaggerated for a clearer description. A detaileddescription on exemplary embodiments described below refers to anaccompanying drawing which shows a specific embodiment as an example.These embodiments are described in detail so that those skilled in thepertinent art can implement an embodiment. It should be understood thata variety of embodiments are different each other, but they do not needto be mutually exclusive. For example, a specific shape, structure andcharacteristic described herein may be implemented in other embodimentwithout departing from a scope and a spirit of the present disclosure inconnection with an embodiment. In addition, it should be understood thata position or an arrangement of an individual element in each disclosedembodiment may be changed without departing from a scope and a spirit ofan embodiment. Accordingly, a detailed description described below isnot taken as a limited meaning and a scope of exemplary embodiments, ifproperly described, are limited only by an accompanying claim along withany scope equivalent to that claimed by those claims.

In the present disclosure, a term such as first, second, etc. may beused to describe a variety of elements, but the elements should not belimited by the terms. The terms are used only to distinguish one elementfrom other element. For example, without getting out of a scope of aright of the present disclosure, a first element may be referred to as asecond element and likewise, a second element may be also referred to asa first element. A term of and/or includes a combination of a pluralityof relevant described items or any item of a plurality of relevantdescribed items.

When an element in the present disclosure is referred to as being“connected” or “linked” to another element, it should be understood thatit may be directly connected or linked to that another element, butthere may be another element between them. Meanwhile, when an element isreferred to as being “directly connected” or “directly linked” toanother element, it should be understood that there is no anotherelement between them.

As construction units shown in an embodiment of the present disclosureare independently shown to represent different characteristic functions,it does not mean that each construction unit is composed in aconstruction unit of separate hardware or one software. In other words,as each construction unit is included by being enumerated as eachconstruction unit for convenience of a description, at least twoconstruction units of each construction unit may be combined to form oneconstruction unit or one construction unit may be divided into aplurality of construction units to perform a function, and an integratedembodiment and a separate embodiment of each construction unit are alsoincluded in a scope of a right of the present disclosure unless they arebeyond the essence of the present disclosure.

A term used in the present disclosure is just used to describe aspecific embodiment, and is not intended to limit the presentdisclosure. A singular expression, unless the context clearly indicatesotherwise, includes a plural expression. In the present disclosure, itshould be understood that a term such as “include” or “have”, etc. isjust intended to designate the presence of a feature, a number, a step,an operation, an element, a part or a combination thereof described inthe present specification, and it does not exclude in advance apossibility of presence or addition of one or more other features,numbers, steps, operations, elements, parts or their combinations. Inother words, a description of “including” a specific configuration inthe present disclosure does not exclude a configuration other than acorresponding configuration, and it means that an additionalconfiguration may be included in a scope of a technical idea of thepresent disclosure or an embodiment of the present disclosure.

Some elements of the present disclosure are not a necessary elementwhich performs an essential function in the present disclosure and maybe an optional element for just improving performance. The presentdisclosure may be implemented by including only a construction unitwhich is necessary to implement essence of the present disclosure exceptfor an element used just for performance improvement, and a structureincluding only a necessary element except for an optional element usedjust for performance improvement is also included in a scope of a rightof the present disclosure.

Hereinafter, an embodiment of the present disclosure is described indetail by referring to a drawing. In describing an embodiment of thepresent specification, when it is determined that a detailed descriptionon a relevant disclosed configuration or function may obscure a gist ofthe present specification, such a detailed description is omitted, andthe same reference numeral is used for the same element in a drawing andan overlapping description on the same element is omitted.

An immersive video, when a user's watching position is changed, refersto an image that a viewport may be also dynamically changed. In order toimplement an immersive video, a plurality of input images are required.Each of a plurality of input images may be referred to as a source imageor a view image. A different view index may be assigned to each viewimage.

An immersive video may be classified into 3DoF (Degree of Freedom),3DoF+, Windowed-6DoF or 6DoF type, etc. A 3DoF-based immersive video maybe implemented by using only a texture image. On the other hand, inorder to render an immersive video including depth information such as3DoF+ or 6DoF, etc., a depth image as well as a texture image is alsorequired.

It is assumed that embodiments described below are for immersive videoprocessing including depth information such as 3DoF+ and/or 6DoF, etc.In addition, it is assumed that a view image is configured with atexture image and a depth image.

FIG. 1 is a block diagram of an immersive video processing deviceaccording to an embodiment of the present disclosure.

In reference to FIG. 1 , an immersive video processing device accordingto the present disclosure may include a view optimizer 110, an atlasgeneration unit 120, a metadata generation unit 130, an video encodingunit 140 and a bitstream generation unit 150.

An immersive video processing device receives a plurality of pairs ofimages, a camera internal variable and a camera external variable as aninput value to encode an immersive video. Here, a plurality of pairs ofimages include a texture image (Attribute component) and a depth image(Geometry component). Each pair may have a different view. Accordingly,a pair of input images may be referred to as a view image. Each of viewimages may be divided by an index. In this case, an index assigned toeach view image may be referred to as a view or a view index.

A camera internal variable includes a focal distance, a position of aprincipal point, etc. and a camera external variable includes aposition, a direction, etc. of a camera. A camera internal variable anda camera external variable may be treated as a camera parameter or aview parameter.

A view optimizer 110 partitions view images into a plurality of groups.As view images are partitioned into a plurality of groups, independentencoding processing per each group may be performed. In an example, viewimages filmed by N spatially consecutive cameras may be classified intoone group. Thereby, view images that depth information is relativelycoherent may be put in one group and accordingly, rendering quality maybe improved.

In addition, by removing dependence of information between groups, aspatial random access service which performs rendering by selectivelybringing only information in a region that a user is watching may bemade available.

Whether view images will be partitioned into a plurality of groups maybe optional.

In addition, a view optimizer 110 may classify view images into a basicimage and an additional image. A basic image represents an image whichis not pruned as a view image with a highest pruning priority and anadditional image represents a view image with a pruning priority lowerthan a basic image.

A view optimizer 110 may determine at least one of view images as abasic image. A view image which is not selected as a basic image may beclassified as an additional image.

A view optimizer 110 may determine a basic image by considering a viewposition of a view image. In an example, a view image whose viewposition is the center among a plurality of view images may be selectedas a basic image.

Alternatively, a view optimizer 110 may select a basic image based on acamera parameter. Specifically, a view optimizer 110 may select a basicimage based on at least one of a camera index, a priority betweencameras, a position of a camera or whether it is a camera in a region ofinterest.

In an example, at least one of a view image with a smallest cameraindex, a view image with a largest camera index, a view image with thesame camera index as a predefined value, a view image filmed by a camerawith a highest priority, a view image filmed by a camera with a lowestpriority, a view image filmed by a camera at a predefined position(e.g., a central position) or a view image filmed by a camera in aregion of interest may be determined as a basic image.

Alternatively, a view optimizer 110 may determine a basic image based onquality of view images. In an example, a view image with highest qualityamong view images may be determined as a basic image.

Alternatively, a view optimizer 110 may determine a basic image byconsidering an overlapping data rate of other view images afterinspecting a degree of data redundancy between view images. In anexample, a view image with a highest overlapping data rate with otherview images or a view image with a lowest overlapping data rate withother view images may be determined as a basic image.

A plurality of view images may be also configured as a basic image.

An Atlas generation unit 120 performs pruning and generates a pruningmask. And, it extracts a patch by using a pruning mask and generates anatlas by combining a basic image and/or an extracted patch. When viewimages are partitioned into a plurality of groups, the process may beperformed independently per each group.

A generated atlas may be composed of a texture atlas and a depth atlas.A texture atlas represents a basic texture image and/or an image thattexture patches are combined and a depth atlas represents a basic depthimage and/or an image that depth patches are combined.

An atlas generation unit 120 may include a pruning unit 122, anaggregation unit 124 and a patch packing unit 126.

A pruning unit 122 performs pruning for an additional image based on apruning priority. Specifically, pruning for an additional image may beperformed by using a reference image with a higher pruning priority thanan additional image.

A reference image includes a basic image. In addition, according to apruning priority of an additional image, a reference image may furtherinclude other additional image.

Whether an additional image may be used as a reference image may beselectively determined. In an example, when an additional image isconfigured not to be used as a reference image, only a basic image maybe configured as a reference image.

On the other hand, when an additional image is configured to be used asa reference image, a basic image and other additional image with ahigher pruning priority than an additional image may be configured as areference image.

Through a pruning process, redundant data between an additional imageand a reference image may be removed. Specifically, through a warpingprocess based on a depth image, data overlapped with a reference imagemay be removed in an additional image. In an example, when a depth valuebetween an additional image and a reference image is compared and thatdifference is equal to or less than a threshold value, it may bedetermined that a corresponding pixel is redundant data.

As a result of pruning, a pruning mask including information on whethereach pixel in an additional image is valid or invalid may be generated.A pruning mask may be a binary image which represents whether each pixelin an additional image is valid or invalid. In an example, in a pruningmask, a pixel determined as overlapping data with a reference image mayhave a value of 0 and a pixel determined as non-overlapping data with areference image may have a value of 1.

While a non-overlapping region may have a non-square shape, a patch islimited to a square shape. Accordingly, a patch may include an invalidregion as well as a valid region. Here, a valid region refers to aregion composed of non-overlapping pixels between an additional imageand a reference image. In other words, a valid region represents aregion that includes data which is included in an additional image, butis not included in a reference image. An invalid region refers to aregion composed of overlapping pixels between an additional image and areference image. A pixel/data included by a valid region may be referredto as a valid pixel/valid data and a pixel/data included by an invalidregion may be referred to as an invalid pixel/invalid data.

An aggregation unit 124 combines a pruning mask generated in a frameunit in an intra-period unit.

In addition, an aggregation unit 124 may extract a patch from a combinedpruning mask image through a clustering process. Specifically, a squareregion including valid data in a combined pruning mask image may beextracted as a patch. Regardless of a shape of a valid region, a patchis extracted in a square shape, so a patch extracted from a square validregion may include invalid data as well as valid data.

In this case, an aggregation unit 124 may repartition a L-shaped orC-shaped patch which reduces encoding efficiency. Here, a L-shaped patchrepresents that distribution of a valid region is L-shaped and aC-shaped patch represents that distribution of a valid region isC-shaped.

When distribution of a valid region is L-shaped or C-shaped, a regionoccupied by an invalid region in a patch is relatively large.Accordingly, a L-shaped or C-shaped patch may be partitioned into aplurality of patches to improve encoding efficiency.

For an unpruned view image, a whole view image may be treated as onepatch. Specifically, a whole 2D image which develops an unpruned viewimage in a predetermined projection format may be treated as one patch.A projection format may include at least one of an EquirectangularProjection Format (ERP), a Cube-map or a Perspective Projection Format.

Here, an unpruned view image refers to a basic image with a highestpruning priority. Alternatively, an additional image that there is nooverlapping data with a reference image and a basic image may be definedas an unpruned view image. Alternatively, regardless of whether there isoverlapping data with a reference image, an additional image arbitrarilyexcluded from a pruning target may be also defined as an unpruned viewimage. In other words, even an additional image that there is dataoverlapping with a reference image may be defined as an unpruned viewimage.

A packing unit 126 packs a patch in a square image. In patch packing,deformation such as size transform, rotation, or flip, etc. of a patchmay be accompanied. An image that patches are packed may be defined asan atlas.

Specifically, a packing unit 126 may generate a texture atlas by packinga basic texture image and/or texture patches and may generate a depthatlas by packing a basic depth image and/or depth patches.

For a basic image, a whole basic image may be treated as one patch. Inother words, a basic image may be packed in an atlas as it is. When awhole image is treated as one patch, a corresponding patch may bereferred to as a complete image (complete view) or a complete patch.

The number of atlases generated by an atlas generation unit 120 may bedetermined based on at least one of an arrangement structure of a camerarig, accuracy of a depth map or the number of view images.

A metadata generation unit 130 generates metadata for image synthesis.Metadata may include at least one of camera-related data,pruning-related data, atlas-related data or patch-related data.

Pruning-related data includes information for determining a pruningpriority between view images. In an example, at least one of a flagrepresenting whether a view image is a root node or a flag representingwhether a view image is a leaf node may be encoded. A root noderepresents a view image with a highest pruning priority (i.e., a basicimage) and a leaf node represents a view image with a lowest pruningpriority.

When a view image is not a root node, a parent node index may beadditionally encoded. A parent node index may represent an image indexof a view image, a parent node.

Alternatively, when a view image is not a leaf node, a child node indexmay be additionally encoded. A child node index may represent an imageindex of a view image, a child node.

Atlas-related data may include at least one of size information of anatlas, number information of an atlas, priority information betweenatlases or a flag representing whether an atlas includes a completeimage. A size of an atlas may include at least one of size informationof a texture atlas and size information of a depth atlas. In this case,a flag representing whether a size of a depth atlas is the same as thatof a texture atlas may be additionally encoded. When a size of a depthatlas is different from that of a texture atlas, reduction ratioinformation of a depth atlas (e.g., scaling-related information) may beadditionally encoded. Atlas-related information may be included in a“View parameters list” item in a bitstream.

In an example, geometry_scale_enabled_flag, a syntax representingwhether it is allowed to reduce a depth atlas, may be encoded/decoded.When a value of a syntax geometry_scale_enabled_flag is 0, it representsthat it is not allowed to reduce a depth atlas. In this case, a depthatlas has the same size as a texture atlas.

When a value of a syntax geometry_scale_enabled_flag is 1, it representsthat it is allowed to reduce a depth atlas. In this case, informationfor determining a reduction ratio of a depth atlas may be additionallyencoded/decoded. In an example, geometry_scaling_factor_x, a syntaxrepresenting a horizontal directional reduction ratio of a depth atlas,and geometry_scaling_factor_y, a syntax representing a verticaldirectional reduction ratio of a depth atlas, may be additionallyencoded/decoded.

An immersive video output device may restore a reduced depth atlas toits original size after decoding information on a reduction ratio of adepth atlas.

Patch-related data includes information for specifying a position and/ora size of a patch in an atlas image, a view image to which a patchbelongs and a position and/or a size of a patch in a view image. In anexample, at least one of position information representing a position ofa patch in an atlas image or size information representing a size of apatch in an atlas image may be encoded. In addition, a source index foridentifying a view image from which a patch is derived may be encoded. Asource index represents an index of a view image, an original source ofa patch. In addition, position information representing a positioncorresponding to a patch in a view image or position informationrepresenting a size corresponding to a patch in a view image may beencoded. Patch-related information may be included in an “Atlas data”item in a bitstream.

An image encoding unit 140 encodes an atlas. When view images areclassified into a plurality of groups, an atlas may be generated pergroup. Accordingly, image encoding may be performed independently pergroup.

An image encoding unit 140 may include a texture image encoding unit 142encoding a texture atlas and a depth image encoding unit 144 encoding adepth atlas.

A bitstream generation unit 150 generates a bitstream based on encodedimage data and metadata. A generated bitstream may be transmitted to animmersive video output device.

FIG. 2 is a block diagram of an immersive video output device accordingto an embodiment of the present disclosure.

In reference to FIG. 2 , an immersive video output device according tothe present disclosure may include a bitstream parsing unit 210, animage decoding unit 220, a metadata processing unit 230 and an imagesynthesizing unit 240.

A bitstream parsing unit 210 parses image data and metadata from abitstream. Image data may include data of an encoded atlas. When aspatial random access service is supported, only a partial bitstreamincluding a watching position of a user may be received.

An image decoding unit 220 decodes parsed image data. An image decodingunit 220 may include a texture image decoding unit 222 for decoding atexture atlas and a depth image decoding unit 224 for decoding a depthatlas.

A metadata processing unit 230 unformats parsed metadata.

Unformatted metadata may be used to synthesize a specific view image. Inan example, when motion information of a user is input to an immersivevideo output device, a metadata processing unit 230 may determine anatlas necessary for image synthesis and patches necessary for imagesynthesis and/or a position/a size of the patches in an atlas and othersto reproduce a viewport image according to a user's motion.

An image synthesizing unit 240 may dynamically synthesize a viewportimage according to a user's motion. Specifically, an image synthesizingunit 240 may extract patches required to synthesize a viewport imagefrom an atlas by using information determined in a metadata processingunit 230 according to a user's motion. Specifically, a viewport imagemay be generated by extracting patches extracted from an atlas includinginformation of a view image required to synthesize a viewport image andthe view image in the atlas and synthesizing extracted patches.

FIGS. 3 and 5 show a flow chart of an immersive video processing methodand an immersive video output method, respectively.

In the following flow charts, what is italicized or underlinedrepresents input or output data for performing each step. In addition,in the following flow charts, an arrow represents processing order ofeach step. In this case, steps without an arrow indicate that temporalorder between corresponding steps is not determined or thatcorresponding steps may be processed in parallel. In addition, it isalso possible to process or output an immersive video in order differentfrom that shown in the following flow charts.

An immersive video processing device may receive at least one of aplurality of input images, a camera internal variable and a cameraexternal variable and evaluate depth map quality through input dataS301. Here, an input image may be configured with a pair of a textureimage (Attribute component) and a depth image (Geometry component).

An immersive video processing device may classify input images into aplurality of groups based on positional proximity of a plurality ofcameras S302. By classifying input images into a plurality of groups,pruning and encoding may be performed independently between adjacentcameras whose depth value is relatively coherent. In addition, throughthe process, a spatial random access service that rendering is performedby using only information of a region a user is watching may be enabled.

But, the above-described S301 and S302 are just an optional procedureand this process is not necessarily performed.

When input images are classified into a plurality of groups, procedureswhich will be described below may be performed independently per group.

An immersive video processing device may determine a pruning priority ofview images S303. Specifically, view images may be classified into abasic image and an additional image and a pruning priority betweenadditional images may be set.

Subsequently, based on a pruning priority, an atlas may be generated anda generated atlas may be encoded S304. A process of encoding atlases isshown in detail in FIG. 4 .

Specifically, a pruning parameter (e.g., a pruning priority, etc.) maybe determined S311 and based on a determined pruning parameter, pruningmay be performed for view images S312. As a result of pruning, a basicimage with a highest priority is maintained as it is originally. On theother hand, through pruning for an additional image, overlapping databetween an additional image and a reference image is removed. Through awarping process based on a depth image, overlapping data between anadditional image and a reference image may be removed.

As a result of pruning, a pruning mask may be generated. If a pruningmask is generated, a pruning mask is combined in a unit of anintra-period S313. And, a patch may be extracted from a texture imageand a depth image by using a combined pruning mask S314. Specifically, acombined pruning mask may be masked to texture images and depth imagesto extract a patch.

In this case, for an unpruned view image (e.g., a basic image), a wholeview image may be treated as one patch.

Subsequently, extracted patches may be packed S315 and an atlas may begenerated S316. Specifically, a texture atlas and a depth atlas may begenerated.

In addition, an immersive video processing device may determine athreshold value for determining whether a pixel is valid or invalidbased on a depth atlas S317. In an example, a pixel that a value in anatlas is smaller than a threshold value may correspond to an invalidpixel and a pixel that a value is equal to or greater than a thresholdvalue may correspond to a valid pixel. A threshold value may bedetermined in a unit of an image or may be determined in a unit of apatch.

For reducing the amount of data, a size of a depth atlas may be reducedby a specific ratio S318. When a size of a depth atlas is reduced,information on a reduction ratio of a depth atlas (e.g., a scalingfactor) may be encoded. In an immersive video output device, a reduceddepth atlas may be restored to its original size through a scalingfactor and a size of a texture atlas.

Metadata generated in an atlas encoding process (e.g., a parameter set,a view parameter list or atlas data, etc.) and SEI (SupplementalEnhancement Information) are combined S305. In addition, a sub bitstreammay be generated by encoding a texture atlas and a depth atlasrespectively S306. And, a single bitstream may be generated bymultiplexing encoded metadata and an encoded atlas S307.

An immersive video output device demultiplexes a bitstream received froman immersive video processing device S501. As a result, video data,i.e., atlas data and metadata may be extracted respectively S502 andS503.

An immersive video output device may restore an atlas based on parsedvideo data S504. In this case, when a depth atlas is reduced at aspecific ratio, a depth atlas may be scaled to its original size byacquiring related information from metadata S505.

When a user's motion occurs, based on metadata, an atlas required tosynthesize a viewport image according to a user's motion may bedetermined and patches included in the atlas may be extracted. Aviewport image may be generated and rendered S506. In this case, inorder to synthesize generated patches, size/position information of eachpatch and a camera parameter, etc. may be used.

Each of elements constituting an input image may be classified as anentity. In an example, each of objects included in an input image mayassign a different entity identifier (Entity Identified). Here, anobject may represent an object or a person, etc. included in an inputimage. Alternatively, when an input image is configured with a pluralityof layers, a different entity identifier may be assigned to each layer.Alternatively, after partitioning an input image into a plurality ofregions, a different entity identifier may be assigned to each of aplurality of regions. Entity setting may be selectively performedaccording to a user's need.

An encoder/a decoder according to the present disclosure may supportobject-based image encoding/decoding. Object-based encoding indicatesthat an encoder selects an object in an input image based on an objectmap and partially encodes a selected object.

Each of entities that a different entity identifier is assigned to aninput image may be treated as one object. Accordingly, object-basedimage encoding/decoding may be referred to as entity-based imageencoding/decoding.

An object map may be a binary image which represents a space occupied bya specific object in an input image. In an example, a value of a pixelcorresponding to a region occupied by a specific object in an inputimage may be set as 1 and a value of a pixel corresponding to a regionunoccupied by a specific object may be set as 0.

When object-based coding is supported, each of objects may beindependently encoded/decoded. In other words, each of objects may begenerated in a separate bitstream.

FIG. 6 is a block diagram of an immersive video processing device whichsupports object-based encoding.

An encoder which supports object-based encoding may include anobject-based atlas generation unit 620 instead of an atlas generationunit of the existing encoder.

An object-based atlas generation unit may include an object loader 621,an object separation unit 622, a pruning unit 623, an object masking andmerging unit 624, an aggregation unit 625, an object clustering unit 626and a patch packing unit 627.

An object loader 621 loads an object map. An object map may includeinformation which identifies an object in an input image.

An object separation unit 622 separates a part corresponding to aspecific object from an input image based on an object map. In anexample, when object a and object b are included in an input image, afirst image which includes only object a and a second image whichincludes only object b may be separated from the input image.

When an input image is separated into a plurality of images through anobject separation unit 622, each of separated images may beindependently input to a pruning unit 623.

A pruning unit 623 may perform pruning for each separated image. In thiscase, pruning may be performed based on a pruning priority between inputimages determined in a view optimizer 110.

An object masking and merging unit 624 may generate a pruning mask foran image including a specific object through a pruning result for aseparated image.

An aggregation unit 625 may combine a pruning mask for an object in aunit of an intra-period.

An object clustering unit 626 extracts a patch based on a combinedpruning mask image through a clustering process. Specifically, a squareregion including valid data in a combined pruning mask image may beextracted as a patch.

A patch packing unit 627 may generate an atlas for a specific object bypacking extracted patches. Through the process, each of objects mayconstitute a different atlas. In an example, only patches derived fromone object may be packed to one atlas. In other words, patches to whicha different entity identifier (Entity ID) is assigned may not be packedto one atlas. Accordingly, when a plurality of atlases exist, patchespacked to each atlas may be derived from a different object.

When an object-based encoding method is applied, a bitstream may begenerated per object. Subsequently, a bitstream for each object may bemultiplexed with metadata and transmitted to a decoder. Further, when anobject-based encoding method is applied, an object map may be alsoencoded and transmitted to a decoder.

In a decoder, a bitstream per object may be decoded. And, based on anobject map, decoded objects may be rendered on a space.

When object-based coding is applied, encoding/decoding may be performedindependently and/or in parallel for each of objects. Accordingly, onlysome object(s) may be selectively parsed at a bitstream level or onlysome object(s) may be selectively encoded/decoded.

For it, a method of partitioning a picture into a plurality of tiles ora plurality of sub-pictures may be used. Specifically, encoding/decodingmay be performed only for a tile or a sub-picture which includes anobject to be encoded/decoded among a plurality of tiles or sub-pictures.

Further, for each of a plurality of tiles or sub-pictures, informationfor identifying an object included in a tile or a sub-picture (e.g., anobject ID) may be additionally signaled. In this case, based on whetheran object to be encoded/decoded is included in a tile or a sub-picture,whether a tile or a sub-picture will be explicitly encoded/decoded maybe determined.

Meanwhile, a volumetric video filmed or generated by a 3DoF, 3DoF+ or6DoF filming camera may be referred to as a MIV (Mpeg Immersive Video)type image. On the other hand, a volumetric video generated by a methoddifferent from above may be referred to as a non-MIV type image. In anexample, a volumetric video such as point cloud, mesh or multi-view maybe referred to as a non-MIV type image.

When an immersive video is configured with homogeneous MIV type imagesalone, there is a problem that it is difficult to independently controlor utilize an attribute of an object embedded in a bitstream generatedbased on object-based coding.

In order to resolve the problem, a method of constituting an immersivevideo by combining heterogeneous type images may be considered.Specifically, when object-based coding is applied, an image for a firstobject in a MIV type and an image for a second object in a non-MIV typemay be combined to constitute an immersive video.

For it, in an immersive video encoding process, when an input image is anon-MIV type image, a process of converting a non-MIV type image into aMIV type image should be accompanied.

FIG. 7 represents an immersive video that heterogeneous type videos arecombined and FIG. 8 is a flow chart which represents anencoding/decoding process of an immersive video shown in FIG. 7 .

In an example shown in FIG. 7 , it is assumed that each of videos for 3persons constituting an immersive video (MIV1, MIV2 and MIV3) and avideo for a background (MIV0) is set as an independent object. Inaddition, it is assumed that MIV0 and MIV1 are a MIV type video and MIV2and MIV3 are a non-MIV type video. Specifically, it is assumed that MIV2is a video in a point cloud type and MIV3 is a video in a mesh type.

Accordingly, MIV0 and MIV1 are mutually homogeneous, but areheterogeneous with MIV1 and MIV2.

In order to encode an immersive video that heterogeneous type videos arecombined, a process of converting a non-MIV type video into a MIV typevideo may be accompanied.

In an example, as in an example shown in FIG. 8 , MIV0 and MIV1, a MIVtype video, are directly input to an encoder, whereas MIV2 and MIV3, anon-MIV type video, may be input to an encoder after being convertedinto a MIV type video.

Instead of inputting input videos to an independent encoder, aftersynthesizing input videos into one video, a synthesized video may beencoded through a single encoder. Specifically, in an example shown inFIG. 8 , MIV0 and MIV1, a MIV type video, and MIV2 and MIV3, a non-MIVtype video, may be synthesized into one MIV video. Specifically, first,a plurality of videos may be synthesized into one 2D image by projectingMIV0 and MIV1, a MIV type video, on a 2D plane according to a predefinedprojection type (e.g., ERP) and converting MIV2 and MIV3, a non-MIV type(e.g., PCC and/or Mesh) video, into a 2D image. Subsequently, asynthesized 2D video may be encoded by using a single encoder and inaddition, a synthesized 2D video may be decoded by using a singledecoder.

In addition, in order to support object-based coding, an object map fora synthesized 2D video (Entity Map) may be newly generated and encoded.

As above, in order to encode an immersive video that heterogeneous typevideos are combined, an immersive video processing device may confirm atype of an input video and selectively perform conversion processingaccording to a confirmed video type.

As a non-MIV type video is converted into a MIV type video, a convertedvideo has a unique attribute suitable for a MIV type. As above,information representing an attribute of a converted MIV type video(e.g., content description information) may be explicitly encoded andsignaled as metadata.

After performing encoding processing for each of images input by anencoder, encoded bitstreams may be multiplexed. A multiplexed bitstreamis separated through demultiplexing and each of separated bitstreams isinput to a separate decoder. Subsequently, decoded data may be renderedaccording to an attribute set per object unit and/or object type.

As described above, in order to render an immersive video thatheterogeneous type videos are combined, attribute information per objectunit and/or object type needs to be explicitly encoded and signaled. Inan example, attribute information as above may be content descriptioninformation on an encoded video or may be encoded and signaled as a kindof metadata.

Content description information may include at least one of image typeinformation, reflection property information, dynamic information,service period information, frame rate information or atlas sizeinformation.

Video type information may represent whether a video to beencoded/decoded is a MIV type or a non-MIV type, or whether a video tobe encoded/decoded is converted into a MIV type. In an example, videotype information may include at least one of a flag representing whethera video to be encoded/decoded is a MIV type or an index representing atype of a video to be encoded/decoded.

A value of a flag may be determined based on whether a video input todetermine whether conversion processing is needed is a MIV type. In anexample, in an example shown in FIG. 7 , MIV0 and MIV1 are a MIV typevideo, so a value of a flag may be set as 0 for the two videos. On theother hand, MIV2 and MIV3 are a non-MIV type video, so a value of a flagmay be set as 1 for the two videos.

An index representing a video type may represent at least one of a MIVtype, PCC, mesh or RGBD. The index may be encoded only when a value of aflag is 1. Alternatively, instead of encoding/decoding the flag, only anindex may be encoded/decoded.

Reflection property information represents at least one of whether avideo to be encoded/decoded has a Lambertian reflection property orwhether it has a partial reflection property. The reflection propertyinformation may include at least one of a 1-bit flag representingwhether to have a Lambertian reflection property and a 1-bit flagrepresenting whether a Lambertian reflection property is partiallyrepresented. Alternatively, the reflection property information may beindex information and the index information may indicate one of nothaving a Lambertian reflection property, having a Lambertian reflectionproperty and having a partial Lambertian reflection property. Reflectionproperty information may be selectively encoded/decoded only when avideo type is a predefined type or one of predefined types.

Dynamic information represents whether an object corresponding to aninput video is a dynamic object or a static object. Division of adynamic object and a static object may be determined based on whether amotion of a corresponding object occurs during a predetermined period.Here, a predetermined period may represent a service period of an inputvideo.

Service period information represents a service period of an inputvideo. Service period information may include at least one of startpoint information, end point information or duration information of aservice period. In this case, only when an object corresponding to aninput video is a dynamic object, service period information may beencoded/decoded.

Frame rate information represents a frame rate of an input video.

Content description information may be encoded/decoded in a sequenceunit of an input video.

Alternatively, for content description information, at least one ofvideo type information, reflection property information, dynamicinformation, service period information or frame rate information may beencoded/decoded in a unit of a sequence, whereas the rest may beencoded/decoded in a unit of a GOP (Group of Pictures).

Based on the content description information, only object(s)corresponding to a specific type of video may be partiallyencoded/decoded or only valid object(s) within a predetermined periodmay be partially encoded/decoded.

FIG. 9 illustrates an attribute value of each of input videos which maybe represented by content description information.

FIG. 9(a) schematizes an attribute of each video and FIG. 9(b)schematizes a service period of each video configured based on theattribute.

In an example shown in FIG. 9(a), it was illustrated that content typeinformation and reflection property information are encoded/decoded in aunit of a sequence and dynamic information, service period informationand frame rate information are encoded/decoded in a unit of a GOP.

In an example shown in FIG. 9(a), it was illustrated that MIV0 has anon-Lambertian reflection property as a MIV type video and that MIV1 hasa Lambertian reflection property as a MIV type video.

In addition, it was illustrated that MIV2 has a partial Lambertianreflection property as a video in a point cloud type and that MIV3 has anon-Lambertian reflection property as a video in a mesh type.

FIGS. 10 and 11 represent a flow chart of an encoding/decoding processof an immersive video that heterogeneous type videos are combinedaccording to an embodiment of the present disclosure.

An immersive video processing device checks whether an input video is aMIV type video 51010. When an input video is a MIV type, an input MIVtype video is directly input to an encoder. On the other hand, when aninput video is not a MIV type, an input video is converted into a MIVtype S1020 and a converted MIV type video is input to an encoder.

Subsequently, an input MIV type video is encoded S1030. Encoding of aMIV type video may include a process of atlas generation and encodingand metadata encoding.

The process may be repetitively/independently performed per object.

If a plurality of bitstreams are generated through the process, aplurality of bitstreams may be multiplexed S1040.

In an immersive video output device, through demultiplexing, receiveddata may be separated into a plurality of bitstreams S1110. And then, aseparated bitstream may be decoded S1120. The above-described decodingmay include decoding for image data and decoding of metadata. Further,decoding may be also performed only for some bitstreams corresponding toan object selected among a plurality of bitstreams. In this case, when aplurality of objects are selected, decoding may be performedindependently and/or in parallel for each of a plurality of bitstreamscorresponding to selected objects.

Subsequently, a decoded image may be rendered by using generatedmetadata S1130. Through it, objects generated from heterogeneous typevideos may constitute one scene.

A name of syntax elements introduced in the above-described embodimentsis just temporarily given to describe embodiments according to thepresent disclosure. Syntax elements may be named differently from whatwas proposed in the present disclosure.

In the above-described embodiments, methods are described based on aflow chart as a series of steps or units, but the present disclosure isnot limited to order of steps and some steps may occur simultaneously orin order different from other steps described above. In addition, thoseskilled in the pertinent art may understand that steps shown in a flowchart are not exclusive and other steps may be included, or one or moresteps in a flow chart may be deleted without affecting a scope of thepresent disclosure.

The above-described embodiment includes examples of various aspects. Allpossible combinations for representing various aspects may not bedescribed, but those skilled in the pertinent art may recognize thatother combinations are possible. Accordingly, it may be said that thepresent disclosure includes all other replacements, modifications andchanges which fall within the following scope of patent claims.

Embodiments according to the present disclosure described above may berecorded in a computer readable recoding medium by being implemented ina form of a program instruction which may be performed through a varietyof computer components. The computer readable recoding medium mayinclude a program instruction, a data file, a data structure, etc. aloneor in combination. A program instruction recorded in the computerreadable recoding medium may be specially designed and configured forthe present disclosure or may be available by being notified to thoseskilled in a computer software art. An example of a computer readablerecoding medium includes magnetic media such as a hard disk, a floppydisk and a magnetic tape, optical recording media such as a CD-ROM and aDVD, magneto-optical media such as a floptical disk and a hardwaredevice which is specially configured to store and execute a programinstruction such as ROM, RAM, a flash memory, etc. An example of aprogram instruction includes not only a machine language code asgenerated by a compiler, but also a high-level language code which maybe executed by a computer with an interpreter, etc. The hardware devicemay be configured to operate as one or more software modules to performprocessing according to the present disclosure and vice versa.

As above, the present disclosure was described by a limited embodimentand drawing and specific matters such as a specific component, etc., butit is just provided to help more general understanding of the presentdisclosure, and the present disclosure is not limited to theembodiments, and those skilled in the pertinent art from the followingdescription may make a variety of modifications and variations from sucha description.

Accordingly, an idea of the present disclosure should not be limited tothe above-described embodiment, and all modifications equal orequivalent to a scope of this patent claim as well as a scope of patentclaims described below fall within a scope of an idea of the presentdisclosure.

1. A method of encoding an immersive video, the method comprising:determining whether an input image is a first type; converting the inputimage into the first type when the input image is a second typedifferent from the first type; encoding a converted image; andgenerating metadata for the encoded image.
 2. The method of claim 1,wherein the metadata includes video type information for the encodedimage.
 3. The method of claim 1, wherein: the encoded image is an imagefor a predetermined object, and the metadata includes dynamicinformation representing a dynamic characteristic of the object.
 4. Themethod of claim 3, wherein the dynamic information indicates whether theobject is in a static state or in a dynamic state within a predeterminedperiod.
 5. The method of claim 4, wherein: the predetermined period is aservice period of the encoded image, and the metadata further includesduration information representing the service period.
 6. The method ofclaim 4, wherein: the predetermined period is represented in a unit ofGOP (Group of Pictures), and the dynamic information is encoded per thepredetermined period within a service period of the encoded image.
 7. Amethod of decoding an immersive video, the method comprising: obtaininga plurality of bitstreams through demultiplexing; decoding at least oneof the plurality of bitstreams; and rendering an immersive video basedon a decoded image and decoded metadata, wherein the metadata includesvideo type information representing whether the decoded image is a firsttype.
 8. The method of claim 7, wherein: the decoded image is an imagefor a predetermined object, and the metadata includes dynamicinformation representing a dynamic characteristic of the object.
 9. Themethod of claim 8, wherein the dynamic information indicates whether theobject is in a static state or in a dynamic state within a predeterminedperiod.
 10. The method of claim 9, wherein: the predetermined period isa service period of the decoded image, and the metadata further includesduration information representing the service period.
 11. The method ofclaim 9, wherein: the predetermined period is represented in a unit ofGOP (Group of Pictures), and the dynamic information is decoded per thepredetermined period within a service period of the decoded image. 12.The method of claim 7, wherein: each of the plurality of bitstreamsincludes data for a different object, and the decoding is performed onlyfor at least one bitstream corresponding to the object selected among aplurality of objects.